Introduction

According to the World Health Organization, every year, about 800,000 people die due to suicide. In this project, with a joint dataset from United Nations Development Program, World Bank, Kaggle, and World Health Organization, we examined current trend of Suicide Commitmennts. In particular, we are intrested in:

Data

The main dataset for our project is a combined dataset from summary datasets made by United Nations Development Program, World Bank, Kaggle, and World Health Organization. It can be access at here. This dataset has a range from 1985 to 2016. However, since there are very few data in 2016, we will only keep the range from 1985 to 2015. The raw dataset has a size of 27660 observations and 8 features. Basic features we are interested in include:

Besides those, we will derive our main interested variable, Suicides Per 100K as Suicides_no divided by Population and mutiplied by 100,000. The sample of the final dataset is shown below:

year country sex age population gdp_per_capita suicides_no suicide_per_100k
1985 Antigua and Barbuda female 15-24 7709 3850 0 0
1985 Antigua and Barbuda female 25-34 6344 3850 0 0
1985 Antigua and Barbuda female 35-54 6173 3850 0 0
1985 Antigua and Barbuda female 5-14 7339 3850 0 0
1985 Antigua and Barbuda female 55-74 3778 3850 0 0
1985 Antigua and Barbuda female 75+ 949 3850 0 0

Statistical Analysis

Regression analysis will be the main method in our study.

Results

Global Trend of Suicide Per 100k Populationn over time

Before 1995, the suicide rate at the global level is increasing, but since then, it keeps decreasing. The peak of the suicide rate among those 30 years was 1995.

Global Trend of Suicide Per 100k Populationn by gender over time

We found that surprisingly, male has higher rate of suicide than female since 1985. Female suicide rate has a very stable trend throughout the history, while there were dramatic changes for male. As mentioned in the other graph, there is a peak in 1995 for the suiceide rate.

Global Trend of Suicide Per 100k Populationn by age over time

Suicide rates for the youngest age group nearly constant and low over time. As the graph shown, elder groups have had higher suicide rate since 1985, and surprisingly such trend has not changed once. For age groups 25-34, 35-54 and 55-74 had similar trends over those 30 years.

Suicide Rate by Country GDP

GDP has been viewed as a good measure about the development of a country. However, graph below shows that there are no obvious trend between GDP and suicide rate. Although GDPs across the world have been shifted toward larger direction, such trend persists.

After seen the trends, we researched about the year 1995 and …

Countries with most suicides across the years

We can see that United States, Russian Federation and Japan are the top 3 countries with highest suicides per year. US, Japan, and Germany (number 4 in the rank) have highest GDP rank as well. After researching, Russian Federation has high suicide rate might due to heavy alcoholic use, with an estimated half of all suicides correlated with alcohol abuse.

Inference

From the distribution plot of suicide_100k_pop, we can see that we need to transform it to satisfy the assumptions for linear model. we used log transformation and changed 0’s to 0.01 for further calculations. Following graphs show that after transformation, the distribution has been much more normal than the previous one.

We chose year, sex, age, \(sex*age\) and gdp_per_capita as our predictor for predicting log suicide rate. The reason behind the variable chosen process is that sex, age and GDP were our main interest at the beginning. We used \(sex*age\) (the interaction term) because we think age and sex might interact together as effect measure of modifier or confounder. For example women in menopause might have higher suicide rate because of hormonal fluctuation. From the plots, we can see that the residuals are mostly constant but there are 3 clusters(one on the left and two on the right); normality assumption is nearly held; there are about 3 outliers.

## 
## Call:
## lm(formula = log_suicide ~ year + sex * age + gdp_per_capita, 
##     data = maindata_log_y)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7101 -0.6122  0.1882  0.6922  3.2761 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.041e+01  1.614e+00  12.649  < 2e-16 ***
## year             -9.692e-03  8.074e-04 -12.005  < 2e-16 ***
## sexmale           1.097e+00  3.133e-02  35.021  < 2e-16 ***
## age25-34          8.270e-02  3.133e-02   2.640 0.008300 ** 
## age35-54          2.699e-01  3.133e-02   8.616  < 2e-16 ***
## age5-14          -1.625e+00  3.133e-02 -51.873  < 2e-16 ***
## age55-74          3.531e-01  3.133e-02  11.272  < 2e-16 ***
## age75+            4.404e-01  3.133e-02  14.057  < 2e-16 ***
## gdp_per_capita    6.678e-06  3.604e-07  18.530  < 2e-16 ***
## sexmale:age25-34  2.767e-01  4.431e-02   6.246 4.26e-10 ***
## sexmale:age35-54  2.562e-01  4.431e-02   5.783 7.42e-09 ***
## sexmale:age5-14  -8.449e-01  4.431e-02 -19.070  < 2e-16 ***
## sexmale:age55-74  1.647e-01  4.431e-02   3.718 0.000201 ***
## sexmale:age75+    2.615e-01  4.431e-02   5.903 3.60e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.064 on 27646 degrees of freedom
## Multiple R-squared:  0.5109, Adjusted R-squared:  0.5107 
## F-statistic:  2221 on 13 and 27646 DF,  p-value: < 2.2e-16

Prediction

## Analysis of Variance Table
## 
## Response: log_suicide
##                   Df  Sum Sq Mean Sq   F value    Pr(>F)    
## country           99 15390.2   155.5   264.545 < 2.2e-16 ***
## sex                1  8615.5  8615.5 14661.236 < 2.2e-16 ***
## age                5 22524.9  4505.0  7666.255 < 2.2e-16 ***
## population         1    19.1    19.1    32.496 1.207e-08 ***
## gdp_per_capita     1    96.9    96.9   164.853 < 2.2e-16 ***
## sex:age            5  1103.2   220.6   375.467 < 2.2e-16 ***
## Residuals      27547 16187.6     0.6                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 0.7457909

For the significance of building a model for prediction, we used adjusted R squared as our criteria of selection. Thus, we built a larger model including country, sex, age, population, gdp_per_capita, and \(sex*age\) as our predictors. The adjusted R squared is 0.746 for this model which means 74.6% of the variability of log suicide rate was explained by our model. However, since this model included more variables, so this model might be harder to interpret. From the plots, we can see that the residuals are mostly constantand better than the smaller model; normality assumption is nearly held ; there are about 3 outliers.

Discussion

Strength:

This study is a longitudinal study with a time spin of 30 years, which gives us more chance to explore the justifications and trends behand the data.

Limitations:

First of all, there are only few variables that may related with suicide rate. However, there are other factors or latent factors that may reflected by suicide rate such as alcohol use.

Secondly, since we have done a global trend analysis, we need data from all over the world. However, this particular data set lacks the suicide information from Asia and Africa.

Lastly, the data is not individual-leveled which may intervene our analysis for prediction purpose.